Enhancing genetic association power in endometriosis through unsupervised clustering of clinical subtypes identified from electronic health records

Endometriosis is a complex and heterogeneous condition affecting 10% of reproductive-age women, and yet, it often goes undiagnosed for several years. Limited observed heritability (7%) of large genetic association studies may be attributable to underlying heterogeneity of disease mechanisms. Therefore, we conducted this study to investigate genetic associations across sub-phenotypes of endometriosis. We performed unsupervised clustering of 4,078 women with endometriosis based on known endometriosis risk factors, symptoms, and concomitant conditions. The clusters were characterized by examining electronic health record (EHR) data and comprehensive chart reviews. We then performed genetic association for each cluster with 39 endometriosis-associated loci (Total Nendometriosis cases = 12,350). We identified five sub-phenotype clusters: (1) pain comorbidities, (2) uterine disorders, (3) pregnancy complications, (4) cardiometabolic comorbidities, and (5) HER-asymptomatic. Bonferroni significant loci included PDLIM5 for the cluster 1, GREB1 for cluster 2, WNT4 for cluster 3, RNLS for cluster 4, and ABO for cluster 5. The difference in associations between the groups suggests complex and varied genetic mechanisms of endometriosis and its symptoms. This study enhances our understanding of the clinical patterns of endometriosis sub-phenotypes, showcasing the innovative approach employed to investigate this complex disease.


Introduction
Endometriosis, a complex gynecological condition affects 10% of women of reproductive age globally and more than 50% of women with infertility 1 , yet it often goes either undiagnosed or misdiagnosed, leading to delayed diagnoses and delivery of effective therapy 2,3 .Endometriosis is characterized by a variety of symptoms that can significantly impact the quality of life of affected individuals.While the most common symptom of endometriosis is pelvic pain, endometriosis is formally characterized by the presence of endometrial-like tissue outside of the uterus via surgical procedure.Importantly, endometriosis is not just a disease of pelvis, but instead a chronic systemic disease affecting the entire human body 4 .Emerging evidence suggests that endometriosis causes changes in metabolism and gene expression in the central nervous system, leading to low body mass index (BMI) and neurological depression symptoms 5 .The extent of disease observed during surgery or via imaging does not necessarily correlate with reported symptoms (or vice versa).Instead, complex pathophysiology contributes to differences in the clinical and biological manifestations of endometriosis.The diagnosis of endometriosis relies heavily on clinical presentation, and traditional diagnostic strategies consider only patients who present with typical symptoms 6 , overlooking women with atypical or distant manifestations.Understanding and quantifying this pathological heterogeneity is critical to design noninvasive, precise diagnostic tools that will enable accurate diagnoses and personalized treatment options.Electronic Health Records (EHRs) represent a rich, yet underutilized, data source for capturing the phenotypic spectrum of endometriosis 7 .Although the symptoms for endometriosis can be quite severe, including chronic debilitating pain, dyspareunia, and infertility, the average time to diagnosis is 4.5 years 8 , in part because the only way to definitively diagnose endometriosis is by surgical observation of endometrial lesions growing outside of the uterus (e.g.abdominal cavity, pelvis, ovaries, etc.) 9 .The variability in symptoms and disease presentation adds to the difficulty of diagnosis and hinders the optimal use of electronic health records (EHRs) in research for accurately identifying affected individuals and control subjects [10][11][12] , which is critical for understanding the disease and advancing treatment strategies.The depth and breadth of EHR data provide a unique opportunity to apply unsupervised learning techniques for the identification of distinct phenotypic clusters that may correspond to clinical subtypes of endometriosis.
Despite the prevalence and severity of endometriosis, etiology of endometriosis is still poorly understood.The pursuit thus far of biomarkers and drug targets based on genetic contributions of disease in patients with endometriosis has mainly included genome-wide association studies to identify genetic variants contributing to the disease 13,14 .Twin studies have estimated the heritability of endometriosis to be 47.5% 15 , and common variants are estimated to contributed 26% of phenotypic variance 16 , but the largest GWAS to-date (N > 750,000, 60,674 cases) has only explained 7% of the phenotypic variance 14 .Although large-scale genomic studies have promised insights into the underlying genetic mechanisms of endometriosis, yet the heterogeneity of the disease presentation has consistently complicated these efforts.Traditional genetic association studies have struggled to untangle the intricate web of genotypic and phenotypic diversity within endometriosis patients, leading to a critical need for innovative approaches to dissect the disease's complexity.
The complex nature of endometriosis, with its diverse symptoms and overlapping features with other gynecological diseases, presents challenges for understanding its genetic mechanisms.
We hypothesized that underlying clinical heterogeneity is obscuring the genetic mechanisms and preventing large-scale genetic studies from explaining more of the heritability.Building on the premise that a more nuanced understanding of endometriosis subtypes could unlock new genetic associations, our study leverages unsupervised, phenotypic clustering analysis of EHR data to systematically identify and characterize clinical subtypes of endometriosis.By dissecting the heterogeneity inherent in the disease, we aim to increase the power of genetic association analyses, facilitating the identification of subtype-specific disease mechanisms.This approach not only promises to enhance our understanding of endometriosis genetics but also to refine diagnostic criteria and inform more targeted and effective treatment strategies.

Derivation, study, and validation datasets
This study utilized six total EHR-linked biobank datasets to investigate the genetic mechanisms underlying endometriosis and its sub-phenotypes.The datasets used were endometriosis cases in the non-genotyped PMBB for the derivation of clusters, a chart-reviewed endometriosis cohort to help characterize the clusters, the genotyped PMBB, six sites within the eMERGE network, AOU, UKBB, and BioVU for genome-wide association analyses (See Methods).The sample sizes for each cohort, the numbers of cases and controls, and the mean age at the time of data pull for each cohort are shown in Table 1.See Methods for details on each of the four datasets.By leveraging these datasets, the study aimed to identify endometriosis sub-phenotypes and gain insights into their associated genetic factors   Based on these tests, we first eliminated DBSCAN because the inferred number of clusters was 131, a far too complex model to be useful or interpretable.Next, we eliminated hierarchical clustering because the sizes of the resulting clusters were more uneven than the other methods.

Derivation of unsupervised of clusters
Spectral clustering and k-means clustering were ultimately more difficult to choose between, but when we focused on the shapes of the distortion curves across the values of K, we observed that kmeans lacked an "elbow" to show a clear optimal K value whereas spectral clustering clearly indicated 5 as an ideal K with a local minimum.Thus, we chose spectral clustering with K=5 as our unsupervised subtyping model.The sizes of the final clusters were: (1) 441 -11%, (2) 686 -17%, (3) 1,151-28%, (4) 796 -20%, and (5) 1,004 -25%.Figure 2 illustrates the eigenvectors of the affinity matrix which were used for clustering the data points.

Data-driven cluster characterization
After clustering, we aimed to characterize these clusters by observing patterns in clinical presentation (prevalence) amongst the input features.We performed two sets of z-score proportion tests comparing prevalence of each feature between each cluster and the other four clusters in our training set.The first set of tests was performed on the original cluster derivation cohort, and the features included were the 17 input features (symptoms and comorbidities with prevalence > 5%) as well as ICD-defined anatomical subtypes of endometriosis (Figure 3).Among the five clusters identified in the training set, there were many input features and ICDbased anatomical subtypes with significantly different proportions.To identify distinguishing features between the clusters, we focus on phenotypes which were significantly enriched and had the highest prevalence in that cluster.Cluster one had the highest rates of (and was significantly enriched for) dysuria (Z=8.9),migraine (Z=10.6),IBS (Z=10.3),fibromyalgia (Z=15.3),asthma (10.3), abdominal pelvic pain (Z=13.6), and shortness of breath (Z=13.5).Cluster two had the highest rates of the following significantly enriched traits: dysmenorrhea (Z=21.9),infertility (Z=5.9),irregular menstruation (Z=31.75),leiomyoma of uterus (Z=21.9),and uterine endometriosis defined by ICD-9 617.0* or ICD-10 N80.0* (Z=13.4).Cluster three's defining features were high risk pregnancy supervision (Z=7.1),superficial lesions defined by ICD-9 617.3* or ICD-10 N80.3* (Z=7.1), and lower abdominal pain (Z=14.6).Individuals in cluster four had highest prevalence of abnormal cholesterol (Z=33.1)and hypertension (Z=33.9),while cluster five was only enriched for unspecified endometriosis defined as ICD-9 617.9* or ICD-10 N80.9* (Z=7.0).
The second set of tests was performed on a subset of endometriosis cases (N=682) from the genotyped PMBB for whom chart reviews were performed by OB-GYN clinical fellows at the University of Pennsylvania Hospital System.The features tested were gold standard confirmed diagnoses (endometriosis, adenomyosis, fibroids, and any ICD false positives), surgical subtypes, hormone use at the time of confirmation procedure, and symptoms identified from a combination of structured data and notes (Figure 4).Because the size of our chart-reviewed dataset was limited, there were fewer significant tests.

Candidate gene association testing stratified by clusters
We applied the subtype classifications observed in our derivation set to our five genetic association datasets, PMBB, eMERGE, AOU, UKBB, and BioVU.We used a K-nearest neighbors model with k=3 to assign endometriosis cases to the five phenotypes based on the same 17 EHR-derived features (Table 2).
Table 2. Clusters by dataset.Counts and proportions of endometriosis cases in each cluster by dataset, along with the faction of cases for each subtype.
The smallest cluster was the pain comorbidities cluster, with only 14.6% of total endometriosis cases being assigned to this cluster.The EHR-asymptomatic cluster was the largest cluster overall (28.0% of cases).The other three clusters occurred in relatively even proportions in the overall meta-analysis group at 18.5% (uterine disorders), 20.0% (pregnancy complications), and 18.9% (cardiometabolic comorbidities).
To establish a reference for the expected level of signal replication, we began with a positive control test.We conducted association tests on 39 established genetic locations (autosomes only) known to be linked to endometriosis (Figure 5).To test whether stratifying by clinical presentation allowed for greater resolution in genetic associations, we performed case-control candidate gene association studies for the five phenotypic clusters by meta-analyzing ancestry-stratified summary statistics from five EHR-linked genetic datasets: PMBB, eMERGE, AOU, UKBB, and BioVU.We observe 17 / 39 loci (44%) significantly associating with one or more clusters (Figure 6a,b).Also, for 20 loci, one or more cluster phenotypes yield stronger associations than the positive control despite having smaller sample sizes (Figure 6c).Some loci were significantly replicated in the negative control, indicated by Figure 6d.shows any loci which were significantly associated with the randomly-assigned negative control phenotypes.

Dataset
The smallest cluster, cluster one, with high rates of pain comorbidities, was significantly associated with one known locus, PDLIM5, and it was more significantly associated than the positive control for eight loci, as shown in Fig 6c .The uterine disorders cluster (two) was significantly associated with six loci, WNT4, GREB1, BSN, SYNE1, GDAP1, and ASTN2.Out of the five loci significantly associated with the pregnancy complications cluster (three), two of them were not significantly associated with any other clusters or the positive control: KDR and KCTD9.Cluster four, enriched for cardiometabolic comorbidities, was significantly associated with four loci, WNT4, VPS13B (unique to this cluster), CDKN2B-AS1, and RNLS, the strongest hit from the positive control.
RNLS was also significantly associated with cluster three.Five loci were significantly associated with the EHR-asymptomatic cluster, and three of those (EBF1, FAM120B, and ABO) had no other associations, even with the positive control.However, two of those five loci (SYNE1 and CDKN2B-AS1) were significantly associated with a negative control test of the same size.

Discussion
In this study, we aimed to investigate the genetics of heterogeneity in endometriosis by defining datadriven subtypes in women from the non-genotyped PMBB endometriosis population (N=4,078).
Unsupervised clustering and statistical enrichment testing across the features, yielded 5 clusters which were labeled as (1) pain comorbidities, (2) uterine disorders, (3) pregnancy complications, (4) cardiometabolic comorbidities, and (5) EHR-asymptomatic.This nuanced phenotyping, which diverges from traditional classifications, allows for a deeper understanding of the pathophysiological variations within endometriosis and highlights the necessity of tailored therapeutic approaches.
After deriving and characterizing the clusters in the non-genotyped PMBB, we used a knearest neighbors' model to transfer the subtypes to the other five EHR-linked genetic datasets, PMBB, eMERGE, AOU, UKBB, and BioVU.Next, we performed ancestry-stratified candidate gene testing for each of the clusters, a positive control, and negative controls using SAIGE.The positive control (overall endometriosis) replicated one genome-wide significant signal (RNLS), while 15 more surpassed the Bonferroni threshold.At the Bonferroni significance threshold, the association with RNLS was also significant for two out of five sub-phenotypes: pregnancy complications and cardiometabolic comorbidities.RNLS is highly expressed in the heart and contributes to regulating blood pressure 17 .In genetic association studies, RNLS has been previously associated with type 1 diabetes 18 and smoking initiation 19 .Smoking is a known risk factor of endometriosis.
While 62% (41% by the positive control) of previously known GWAS loci were replicated in our study by one or more of the clusters or the positive control, substantial differences in associations across the clusters were observed.For instance, SYNE1 and GREB1 showed specific associations with the uterine disorders cluster, suggesting that these genes might play distinct roles in the pathogenesis of these phenotypic presentations of endometriosis.Conversely, the BSN gene, although not statistically significant, demonstrated greater significance in the pain, uterine disorders, and pregnancy complications clusters, indicating a possible link to neurovascular or inflammatory mechanisms that could exacerbate these conditions.Cluster five, which was largely asymptomatic in the EHR, was the largest cluster, comprising 56% of UKBB endometriosis patients.It is possible that those assigned to this cluster has unrecorded symptoms in the structured data we accessed.Two well-known endometriosis loci, SYNE1 and CDKN2B-AS1 13,14 , were significantly associated with both the positive control and the EHR-asymptomatic cluster.However, these loci were also significantly associated with cluster five's negative control.Additionally, three loci: EBF1, FAM120B, and ABO were uniquely associated with this cluster.These genes have links to several other women's health conditions: EBF1 mRNA have been associated with spontaneous preterm birth 20 , FAM120B 21 and ABO 22 have both been found to be related to ovarian cancer.Notably, endometriosis is known to increase one's risk of ovarian cancer 23 .
Subtyping complex diseases, like endometriosis, is crucial for advancing nuanced precision medicine.The findings from our study underscore the utility of EHR as a rich resource for disease subtyping and genetic research.The linkage of detailed clinical data with genetic information enables the identification of phenotype-genotype correlations that are often diluted in broader GWAS analyses.Furthermore, the use of spectral clustering helps elucidate the heterogeneity within endometriosis, providing a framework for understanding the multifaceted nature of the disease and facilitating the development of personalized medicine.
However, it is essential to acknowledge the limitations of our study.One significant constraint was the sample size, which was particularly limited for some of the smaller clusters and for individuals of non-European ancestry.This limitation could potentially introduce bias and affect the generalizability of our findings.Additionally, our study relies on structured electronic health data only, which may not capture the full clinical picture and could be subject to inaccuracies or incomplete records.Last, this genetic association analyses in this study only focused on the candidate genes that are previously known to be associated with endometriosis.This approach might have restricted our ability to discover novel genetic loci potentially relevant to the specific clusters identified.Despite these limitations, our study marks a meaningful advancement in understanding the genetic factors that may contribute to the heterogeneity observed in endometriosis.By focusing on genetic associations gleaned from electronic health records, we offer a novel perspective that could be instrumental in future research and treatment approaches.In conclusion, our research highlights the importance of subtype-specific studies in elucidating the genetic basis of endometriosis.By leveraging the capabilities of EHR-linked biobanks and employing advanced clustering techniques, we pave the way for more targeted and effective approaches to understanding and managing this complex disease.

Methods
Datasets used for sub-phenotyping and genetic association  32 .Briefly, participants were genotyped using a customized Illumina Multi-Ethnic Genotyping chip, and then individuals or variants with high missingness (>2%) were removed.After further filters for revoked consent status or genetic sex discrepancies, whole-genome variants were imputed on the TOPMED Imputation server 33  clusters 39 .Hierarchical agglomerative clustering is an unsupervised classification method that uses a pairwise distance matrix to iteratively merge nearby points together 40 .K-means clustering randomly initializes centroids for each cluster and then alternates between assigning data points to their nearest centroid and adjusting the centroids until convergence 41 .
In addition to choosing an algorithm, a common struggle with unsupervised clustering is choosing a target number of clusters in a non-arbitrary way.We used several empirical metrics for this: silhouette score, distortion score, and a metric we developed to represent the "evenness" of clusters.The silhouette score is a metric which considers both intra-and inter-cluster distances to assess tightness within a cluster and distance between clusters; higher silhouette scores indicate better quality clusters.The distortion score is the sum of squared errors with respect to the centroid of each cluster, thus it is desired to minimize distortion.Our evenness metric, optimized by minimization, was defined as the fractional difference between the size of the largest and smallest clusters.We measured these metrics across tests for 2-20 clusters for each of the four clustering methods (except for DBSCAN which automatically infers the optimal number of clusters).

Characterization of unsupervised clusters
After identifying clinical clusters within our observation dataset, our objective was to delineate their characteristics.We performed two-population z-score proportion tests 42 ).By considering the cluster-specific differences in these EHR-derived features among the two datasets (training set and chart review set), we observed patterns in clinical presentation.Based on these patterns, we assigned labels to each cluster.

Cluster-stratified candidate gene association testing
To identify genetic heterogeneity among the varied clinical presentations of endometriosis, we performed cluster-stratified, ancestry-stratified candidate gene association studies.Using PLINK 2.0 43 , we extracted single nucleotide polymorphisms (SNPs) in LD (kb distance < 0.5 Mb and R 2 > 0.1) with 39 autosomal lead SNPs reported in the most recent endometriosis GWAS 14 .LD was computed based on the thousand genomes reference panel 36 .Cluster phenotypes were assigned for PMBB, eMERGE, and AOU using a K-Nearest neighbors' classifier 44 with K=3 on the same 17 ICDbased features.For each study, we employed a linear mixed model regression method employed in SAIGE 45 to test for associations between genotypes and case-control status.Cases were females with endometriosis from one cluster and controls were biological females with no ICD history of endometriosis.In the regression models we included the first four principal components, age, and batch indicators (eMERGE only) as covariates.The ancestry-stratified results of these studies were then meta-analyzed using Plink 1.9 46 for each of the cluster-phenotypes.Because multiple genetic ancestry groups were included, we chose a random-effects meta-analysis, which is more robust to heterogeneity 47 .
Using the same association testing methodology, we tested a positive control and negative control.The positive control baseline was overall endometriosis (cases from all clusters combined) to identify how many known loci we were able to replicate.The negative control was performed for each of the clusters to see whether we were powered to detect associations regardless of signal coming from the identified subtypes.Endometriosis cases were randomly assigned to one of the five clusters so that the five meta-analyses for the negative control had the same numbers of cases and controls as in the regular analyses.
Unsupervised clustering was performed in non-genotyped PMBB dataset of 4,078 women with EHR-diagnosed endometriosis using 17 clinical features (S2 Fig).We tested four methods for unsupervised clustering as well as 19 values for the number of clusters (K=2-20) and measured three metrics to empirically choose a clustering method and number of clusters (Figure 1).

Figure 1 .
Figure 1.Testing unsupervised clustering methods.We tested various clustering algorithms and K-values to empirically choose an optimal method.The three metrics shown are (a) Manhattan-

Figure 2 .
Figure 2. Eigenvector pair plot.Shown are pairwise scatter plots of the first five eigenvectors of the affinity matrix used for spectral clustering, colored by cluster.This five-dimensional eigenvector space was used for clustering.The diagonal shows kernel density estimator plots for each of the five eigenvectors.

Figure 3 .
Figure 3. Feature tests for the non-genotyped PMBB training set.Shown are (a) z-scores for the difference in proportion tests, annotated with p-values that are significant and (b) feature prevalence by cluster to provide context for the z-score tests.

Figure 4 .
Figure 4. Feature tests for the chart reviewed PMBB dataset.Shown are (a) z-scores for the difference in proportion tests, annotated with p-values that are significant and (b) feature prevalence by cluster to provide context for the z-score tests.

Figure 6 :
Figure 6: Phenotype-specific association test results.The top panel (a) indicates which known loci were significantly replicated by the positive control and five clusters (filled in dots).The outlineof the dots represents whether the cluster had a stronger association than the positive control.An "X" over the dot means that that locus was found to be significant in the negative control for that cluster.The bottom left panel (b) shows the number and names of statistically significant associations for each phenotype.The bottom center panel (c) shows the number of loci for which each phenotype had a more significant association than the baseline.The bottom right panel (d) for remaining individuals.The EHR data are stored in a synthetic derivative (deidentified mirror) of over 3 million VUMC patient records which are updated regularly 34 .All five of the biobanks mentioned above (PMBB, eMERGE, AOU, UKBB, and BioVU) utilize the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) to represent structured EHR data in a harmonized format 35 .For this study, we utilized women with ICDdiagnosed endometriosis in the non-genotyped PMBB cohort (Nendo = 4,078) as the derivation dataset for the clinical subtypes.For deeper characterization of our subtypes, we performed chart-reviews on 682 randomly selected endometriosis cases from the genotyped PMBB.Then, we meta-analyzed women from the genotyped PMBB (N = 20,697, Nendo = 1,198), six non-pediatric sites within the eMERGE network (N = 51,800, Nendo = 2,243), the AOU research program (N = 108,098, Nendo = 2,126), UKBB (N = 261,824, Nendo = 4,451), and BioVU (N = 34,072, Nendo = 1,097) to form our main genetic analysis test set (N = 466,261, Nendo = 12,350).
to determine if the rates of input conditions were significantly different on a cluster-vs-other-clusters basis.For our training set (the non-genotyped PMBB, Nendo = 4,078), we examined two sets of features for the z-score tests: the 17 input features as well as ICD-based anatomical subtypes of endometriosis including adenomyosis, endometrioma, superficial lesions, and deep lesions (S2 Table).We performed a second set of z-score proportion tests on a chart-reviewed dataset of 682 genotyped PMBB patients with endometriosis ICD codes.The chart-review variables collected by OB-GYN clinical fellows at the University of Pennsylvania included gold standard diagnoses of endometriosis and adenomyosis as well as chart-abstracted symptoms, comorbidities, and surgical phenotypes (S3 Table

Table 1 .
Description of Datasets.Cohort sample size and average age of cases and controls for the datasets used in this analysis.Age was considered the age as of when the EHR data were collected.